This practical is based on exploratory data analysis and prediction of a dataset derived from a municipal database of healthcare administrative data. This dataset is derived from Vitoria, the capital city of EspÃrito Santo, Brazil (population 1.8 million) and was freely shared under a creative commons license.
Generate an rmarkdown report that contains all the necessary code to document and perform: EDA, prediction of no-shows using XGBoost, and an analysis of variable/feature importance using this data set. Ensure your report includes answers to any questions marked in bold. Please submit your report via brightspace as a link to a git repository containing the rmarkdown and compiled/knitted html version of the notebook.
The Brazilian public health system, known as SUS for Unified Health System in its acronym in Portuguese, is one of the largest health system in the world, representing government investment of more than 9% of GDP. However, its operation is not homogeneous and there are distinct perceptions of quality from citizens in different regions of the country. Non-attendance of medical appointments contributes a significant additional burden on limited medical resources. This analysis will try and investigate possible factors behind non-attendance using an administrative database of appointment data from Vitoria, EspÃrito Santo, Brazil.
The data required is available via the course website.
1 Use the data dictionary describe each of the variables/features in the CSV in your report.
2 Can you think of 3 hypotheses for why someone may be more likely to miss a medical appointment?
3 Can you provide 3 examples of important contextual
information that is missing in this data dictionary and dataset that
could impact your analyses e.g., what type of medical appointment does
each AppointmentID
refer to?
4 Modify the following to make it reproducible i.e., downloads the data file directly from version control
raw.data <- read_csv('lab1_data/2016_05v2_VitoriaAppointmentData.csv', col_types='fffTTifllllflf')
#raw.data <- readr::read_csv('https://raw.githubusercontent.com/maguire-lab/health_data_science_research/ ... ')
Now we need to check data is valid: because we specified col_types and the data parsed without error most of our data seems to at least be formatted as we expect i.e., ages are integers
raw.data %>% filter(Age > 110)
## # A tibble: 5 × 14
## PatientID AppointmentID Gender ScheduledDate AppointmentDate Age
## <fct> <fct> <fct> <dttm> <dttm> <int>
## 1 3196321161… 5700278 F 2016-05-16 09:17:44 2016-05-19 00:00:00 115
## 2 3196321161… 5700279 F 2016-05-16 09:17:44 2016-05-19 00:00:00 115
## 3 3196321161… 5562812 F 2016-04-08 14:29:17 2016-05-16 00:00:00 115
## 4 3196321161… 5744037 F 2016-05-30 09:44:51 2016-05-30 00:00:00 115
## 5 7482345792… 5717451 F 2016-05-19 07:57:56 2016-06-03 00:00:00 115
## # … with 8 more variables: Neighbourhood <fct>, SocialWelfare <lgl>,
## # Hypertension <lgl>, Diabetes <lgl>, AlcoholUseDisorder <lgl>,
## # Disability <fct>, SMSReceived <lgl>, NoShow <fct>
We can see there are 2 patient’s older than 100 which seems suspicious but we can’t actually say if this is impossible.
5 Are there any individuals with impossible ages? If
so we can drop this row using filter
i.e.,
data <- data %>% filter(CRITERIA)
First, we should get an idea if the data meets our expectations,
there are newborns in the data (Age==0
) and we wouldn’t
expect any of these to be diagnosed with Diabetes, Alcohol Use Disorder,
and Hypertension (although in theory it could be possible). We can
easily check this:
raw.data %>% filter(Age == 0) %>% select(Hypertension, Diabetes, AlcoholUseDisorder) %>% unique()
## # A tibble: 1 × 3
## Hypertension Diabetes AlcoholUseDisorder
## <lgl> <lgl> <lgl>
## 1 FALSE FALSE FALSE
We can also explore things like how many different neighborhoods are there and how many appoints are from each?
count(raw.data, Neighbourhood, sort = TRUE)
## # A tibble: 81 × 2
## Neighbourhood n
## <fct> <int>
## 1 JARDIM CAMBURI 7717
## 2 MARIA ORTIZ 5805
## 3 RESISTÊNCIA 4431
## 4 JARDIM DA PENHA 3877
## 5 ITARARÉ 3514
## 6 CENTRO 3334
## 7 TABUAZEIRO 3132
## 8 SANTA MARTHA 3131
## 9 JESUS DE NAZARETH 2853
## 10 BONFIM 2773
## # … with 71 more rows
6 What is the maximum number of appointments from the same patient?
Let’s explore the correlation between variables:
# let's define a plotting function
corplot = function(df){
cor_matrix_raw <- round(cor(df),2)
cor_matrix <- melt(cor_matrix_raw)
#Get triangle of the correlation matrix
#Lower Triangle
get_lower_tri<-function(cor_matrix_raw){
cor_matrix_raw[upper.tri(cor_matrix_raw)] <- NA
return(cor_matrix_raw)
}
# Upper Triangle
get_upper_tri <- function(cor_matrix_raw){
cor_matrix_raw[lower.tri(cor_matrix_raw)]<- NA
return(cor_matrix_raw)
}
upper_tri <- get_upper_tri(cor_matrix_raw)
# Melt the correlation matrix
cor_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap Plot
cor_graph <- ggplot(data = cor_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "darkorchid", high = "orangered", mid = "grey50",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 8, hjust = 1))+
coord_fixed()+ geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank())+
ggtitle("Correlation Heatmap")+
theme(plot.title = element_text(hjust = 0.5))
cor_graph
}
numeric.data = mutate_all(raw.data, function(x) as.numeric(x))
# Plot Correlation Heatmap
corplot(numeric.data)
Correlation heatmaps are useful for identifying linear relationships
between variables/features. In this case, we are particularly interested
in relationships between NoShow
and any specific
variables.
7 Which parameters most strongly correlate with
missing appointments (NoShow
)?
8 Are there any other variables which strongly correlate with one another?
9 Do you see any issues with PatientID/AppointmentID being included in this plot?
Let’s look at some individual variables and their relationship with
NoShow
.
ggplot(raw.data) +
geom_density(aes(x=Age, fill=NoShow), alpha=0.8) +
ggtitle("Density of Age by Attendence")
There does seem to be a difference in the distribution of ages of people
that miss and don’t miss appointments.
However, the shape of this distribution means the actual correlation is
near 0 in the heatmap above. This highlights the need to look at
individual variables.
Let’s take a closer look at age by breaking it into categories.
raw.data <- raw.data %>% mutate(Age.Range=cut_interval(Age, length=10))
ggplot(raw.data) +
geom_bar(aes(x=Age.Range, fill=NoShow)) +
ggtitle("Amount of No Show across Age Ranges")
ggplot(raw.data) +
geom_bar(aes(x=Age.Range, fill=NoShow), position='fill') +
ggtitle("Proportion of No Show across Age Ranges")
10 How could you be misled if you only plotted 1 of these 2 plots of attendance by age group?
The key takeaway from this is that number of individuals > 90 are very few from plot 1 so probably are very small so unlikely to make much of an impact on the overall distributions. However, other patterns do emerge such as 10-20 age group is nearly twice as likely to miss appointments as the 60-70 years old.
Another interesting finding is the NA
group, they are
the result of trying to assign age of 0 to groups and represent missing
data.
raw.data %>% filter(Age == 0) %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 3539
Next, we’ll have a look at SMSReceived
variable:
ggplot(raw.data) +
geom_bar(aes(x=SMSReceived, fill=NoShow), alpha=0.8) +
ggtitle("Density of SMS received across Age and No Show")
ggplot(raw.data) +
geom_bar(aes(x=SMSReceived, fill=NoShow), position='fill', alpha=0.8) +
ggtitle("Density of SMS received across Age and No Show")
11 From this plot does it look like SMS reminders increase or decrease the chance of someone not attending an appointment? Why might the opposite actually be true (hint: think about biases)?
12 Create a similar plot which compares the the
density of NoShow
across the values of disability
#Insert plot
Now let’s look at the neighbourhood data as location can correlate highly with many social determinants of health.
ggplot(raw.data) +
geom_bar(aes(x=Neighbourhood, fill=NoShow)) +
theme(axis.text.x = element_text(angle=45, hjust=1, size=5)) +
ggtitle('Attendance by Neighbourhood')
ggplot(raw.data) +
geom_bar(aes(x=Neighbourhood, fill=NoShow), position='fill') +
theme(axis.text.x = element_text(angle=45, hjust=1, size=5)) +
ggtitle('Proportional Attendance by Neighbourhood')
Most neighborhoods have similar proportions of no-show but some have much higher and lower rates.
13 Suggest a reason for differences in attendance rates across neighbourhoods.
Now let’s explore the relationship between gender and NoShow.
ggplot(raw.data) +
geom_bar(aes(x=Gender, fill=NoShow))
ggtitle("Gender by attendance")
## $title
## [1] "Gender by attendance"
##
## attr(,"class")
## [1] "labels"
ggplot(raw.data) +
geom_bar(aes(x=Gender, fill=NoShow), position='fill')
ggtitle("Gender by attendance")
## $title
## [1] "Gender by attendance"
##
## attr(,"class")
## [1] "labels"
14 Create a similar plot using
SocialWelfare
#Insert plot
Far more exploration could still be done, including dimensionality reduction approaches but although we have found some patterns there is no major/striking patterns on the data as it currently stands.
However, maybe we can generate some new features/variables that more
strongly relate to the NoShow
.
Let’s begin by seeing if appointments on any day of the week has more
no-show’s. Fortunately, the lubridate
library makes this
quite easy!
raw.data <- raw.data %>% mutate(AppointmentDay = wday(AppointmentDate, label=TRUE, abbr=TRUE),
ScheduledDay = wday(ScheduledDate, label=TRUE, abbr=TRUE))
ggplot(raw.data) +
geom_bar(aes(x=AppointmentDay, fill=NoShow)) +
ggtitle("Amount of No Show across Appointment Day")
ggplot(raw.data) +
geom_bar(aes(x=AppointmentDay, fill=NoShow), position = 'fill') +
ggtitle("Amount of No Show across Appointment Day")
Let’s begin by creating a variable called Lag
, which is the
difference between when an appointment was scheduled and the actual
appointment.
raw.data <- raw.data %>% mutate(Lag.days=difftime(AppointmentDate, ScheduledDate, units = "days"),
Lag.hours=difftime(AppointmentDate, ScheduledDate, units = "hours"))
ggplot(raw.data) +
geom_density(aes(x=Lag.days, fill=NoShow), alpha=0.7)+
ggtitle("Density of Lag (days) by attendance")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
15 Have a look at the values in lag variable, does anything seem odd?
Let’s see how well we can predict NoShow from the data.
We’ll start by preparing the data, followed by splitting it into testing and training set, modeling and finally, evaluating our results. For now we will subsample but please run on full dataset for final execution.
### REMOVE SUBSAMPLING FOR FINAL MODEL
data.prep <- raw.data %>% select(-AppointmentID, -PatientID) #%>% sample_n(10000)
set.seed(42)
data.split <- initial_split(data.prep, prop = 0.7)
train <- training(data.split)
test <- testing(data.split)
Let’s now set the cross validation parameters, and add classProbs so we can use AUC as a metric for xgboost.
fit.control <- trainControl(method="cv",number=3,
classProbs = TRUE, summaryFunction = twoClassSummary)
16 Based on the EDA, how well do you think this is going to work?
Now we can train our XGBoost model
xgb.grid <- expand.grid(eta=c(0.05),
max_depth=c(4),colsample_bytree=1,
subsample=1, nrounds=500, gamma=0, min_child_weight=5)
xgb.model <- train(NoShow ~ .,data=train, method="xgbTree",metric="ROC",
tuneGrid=xgb.grid, trControl=fit.control)
xgb.pred <- predict(xgb.model, newdata=test)
xgb.probs <- predict(xgb.model, newdata=test, type="prob")
test <- test %>% mutate(NoShow.numerical = ifelse(NoShow=="Yes",1,0))
confusionMatrix(xgb.pred, test$NoShow, positive="Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 26385 6390
## Yes 142 242
##
## Accuracy : 0.803
## 95% CI : (0.7987, 0.8073)
## No Information Rate : 0.8
## P-Value [Acc > NIR] : 0.08578
##
## Kappa : 0.0481
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.036490
## Specificity : 0.994647
## Pos Pred Value : 0.630208
## Neg Pred Value : 0.805034
## Prevalence : 0.200006
## Detection Rate : 0.007298
## Detection Prevalence : 0.011581
## Balanced Accuracy : 0.515568
##
## 'Positive' Class : Yes
##
paste("XGBoost Area under ROC Curve: ", round(auc(test$NoShow.numerical, xgb.probs[,2]),3), sep="")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] "XGBoost Area under ROC Curve: 0.74"
This isn’t an unreasonable performance, but let’s look a bit more carefully at the correct and incorrect predictions,
xgb.probs$Actual = test$NoShow.numerical
xgb.probs$ActualClass = test$NoShow
xgb.probs$PredictedClass = xgb.pred
xgb.probs$Match = ifelse(xgb.probs$ActualClass == xgb.probs$PredictedClass,
"Correct","Incorrect")
# [4.8] Plot Accuracy
xgb.probs$Match = factor(xgb.probs$Match,levels=c("Incorrect","Correct"))
ggplot(xgb.probs,aes(x=Yes,y=Actual,color=Match))+
geom_jitter(alpha=0.2,size=0.25)+
scale_color_manual(values=c("grey40","orangered"))+
ggtitle("Visualizing Model Performance", "(Dust Plot)")
Finally, let’s close it off with the variable importance of our model:
results = data.frame(Feature = rownames(varImp(xgb.model)$importance)[1:10],
Importance = varImp(xgb.model)$importance[1:10,])
results$Feature = factor(results$Feature,levels=results$Feature)
# [4.10] Plot Variable Importance
ggplot(results, aes(x=Feature, y=Importance,fill=Importance))+
geom_bar(stat="identity")+
scale_fill_gradient(low="grey20",high="orangered")+
ggtitle("XGBoost Variable Importance")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
17 Using the caret package fit and evaluate 1 other ML model on this data.
18 Based on everything, do you think we can trust analyses based on this dataset? Explain your reasoning.